I am planning to use a combination of plots including histograms, curve and rug plots, scatter/size plots, and finally a scatter/size plot using interactivty.
A histogram is used to visualize a single continuous variable, each bar is used to represent the frequency of that data falling within a specified bin of values. They are oftentimes used to view the distrubtion of a variable and check for gaps, skewness, or normality.
A curve a rug plot is often used in conjunction with other types of plots, such as histograms or density plots, to provide additional information about the distribution of data points along a single axis. A rug plot consists of a single axis, typically the x-axis or y-axis, where each data point is represented by a short vertical line or "tick" along the axis.
A scatter plot is used to visualize the relationship between two continuous variables on a graph where each dot is representative of each of the values against one another. They are often used for identifying patterns such as trends, clusters, or correlations between two variables. Additionally, they can be overlapped with a line of fit to determine if there is a linear correlation between variables.
A size plot is the same as a scatter plot, however it includes a third variable to determine the size of each dot based on a catergorical variable. This allows further analysis of the scatter plot across the third variable to help identify patterns or differences in then relationships of three variables.
Interactive graphs work by allowing users to dynamically explore and interact with data visualizations in real-time by allowing users to customize the visualization according to their preferences. This could include options for filtering data, adjusting parameters, or switching between different views or representations of the data.
Histograms are close to bar plots but are used for representing and understanding the frequency distribution of continuous variables. Alternatives to histograms include density plots or kernel density estimation (KDE) plots, which provide a smoothed estimate of the probability density function. Histograms are designed for continuous data and may not be appropriate for categorical or ordinal data without some form of transformation.
Curve plots, often overlaid on histograms or density plots, represent a smoothed version of the underlying distribution. Rug plots show individual data points along the x-axis. Curve and rug plots provide additional information about the distribution of data, such as whether it is multimodal or skewed, and the density of individual data points along the distribution. Instead of curve plots, you could use cumulative distribution functions (CDFs) to visualize the cumulative distribution of the data. Curve plots rely on certain assumptions about the underlying distribution of the data, such as normality, and may not be suitable for non-parametric datasets.
Scatter plots are close to line plots but are used for visualizing the relationship between two continuous variables. Alternatives to scatter plots include bubble charts, which can incorporate a third dimension by varying the size of markers based on a third variable. Scatter plots help identify patterns, trends, correlations, and outliers within bivariate data. They are useful for exploring relationships between variables and identifying any underlying structure or clusters. When the relationship between variables is non-linear, scatter plots may not accurately capture the underlying patterns, and other visualization methods may be more suitable.
Size plots are close to scatter plots but allow for the visualization of a third dimension of data by varying the size of markers (bubbles) based on a third variable. They are particularly useful when you want to represent three dimensions of data on a two-dimensional plot.
Interactive plots are similar to static plots in that they both visually represent data, but interactive plots offer additional functionality and user interaction that static plots do not. It is important to keep in mind that interactive plots may not be accessible to all users, especially those with disabilities. Interactivity can also lead to overly complex visualizations that confuse or overwhelm users. It's important to consider the target audience and the specific goals of the visualization to determine whether interactivity adds value or detracts from the message.
Plotly is an interactive graphing library that allows users to create interactive plots and dashboards in Python, R, and other programming languages. It was created by a company called Plotly Technologies Inc., which was founded by Alex Johnson, Jack Parmer, and Chris Parmer in 2012.
Plotly offers both open-source and commercial versions of its software. The open-source version, Plotly.py, is available under the MIT license, allowing users to use, modify, and distribute the library freely.
To install Plotly.py, you can use pip, the Python package manager. Here's how to install it in our Jupyter Notebook:
pip install plotly
Requirement already satisfied: plotly in /opt/conda/lib/python3.9/site-packages (5.19.0) Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.9/site-packages (from plotly) (8.2.3) Requirement already satisfied: packaging in /opt/conda/lib/python3.9/site-packages (from plotly) (23.1) Note: you may need to restart the kernel to use updated packages.
Once installed, you can import Plotly.py into your Python scripts or Jupyter notebooks and start creating interactive visualizations using its APIs.
For more detailed installation instructions and usage examples, you can refer to the Plotly.py documentation: https://plotly.com/python/getting-started/
#Let's import a Plotly package now to get started
import plotly.express as px
import plotly as plotly
plotly.offline.init_notebook_mode()
Plotly is primarily declarative in nature; However, Plotly can also be used procedurally in certain contexts. With Plotly, you define your visualization by creating a figure object and specifying its properties such as data, layout, and styling. The library then takes care of the underlying implementation details to render the visualization. Users can still manipulate plots and figures programmatically using procedural techniques especially when creating dynamic or complex visualizations.
Plotly integrates with Jupyter Notebook environments very easily which is why it is a great choice for vizualizing data in this assignment. Plotly's interactive plots render directly within Jupyter Notebook cells, allowing for easy exploration and presentation of data without the need to export plots to external files. This makes it popular for data scientists using Jupyter in as well, as it enables users to create and share interactive visualizations as part of their data analysis workflows.
While Plotly is generally efficient, it may struggle with large datasets or complex visualizations, especially when rendering interactive plots with a high number of data points. This can lead to slower loading times or performance issues, particularly when deploying visualizations in web applications.
The dataset I am using for this demonstration is "Most Streamed Spotify Songs 2023" and found here on Kaggle: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
This dataset contains a comprehensive list of the most famous songs of 2023 as listed on Spotify. It provides insights into each song's attributes, popularity, and presence on various music platforms. The dataset includes information such as track name, artist(s) name, release date, Spotify playlists and charts, streaming statistics, Apple Music presence, Deezer presence, Shazam charts, and various audio features (Rajat Raj via Kaggle).
My goal with this dataset is to conduct an exploratorty analysis to see what trends there were across songs popular in 2023 which could then be used to further analyze different years or data points.
Let's begin!
#Using pandas, import Spotify data as a dataframe
import pandas as pd
data = pd.read_csv('spotify-2023.csv')
#And now a call to preview what we're working with
data.head()
| track_name | artist(s)_name | artist_count | released_year | released_month | released_day | in_spotify_playlists | in_spotify_charts | streams | in_apple_playlists | ... | bpm | key | mode | danceability_% | valence_% | energy_% | acousticness_% | instrumentalness_% | liveness_% | speechiness_% | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Seven (feat. Latto) (Explicit Ver.) | Latto, Jung Kook | 2 | 2023 | 7 | 14 | 553 | 147 | 141381703 | 43 | ... | 125 | B | Major | 80 | 89 | 83 | 31 | 0 | 8 | 4 |
| 1 | LALA | Myke Towers | 1 | 2023 | 3 | 23 | 1474 | 48 | 133716286 | 48 | ... | 92 | C# | Major | 71 | 61 | 74 | 7 | 0 | 10 | 4 |
| 2 | vampire | Olivia Rodrigo | 1 | 2023 | 6 | 30 | 1397 | 113 | 140003974 | 94 | ... | 138 | F | Major | 51 | 32 | 53 | 17 | 0 | 31 | 6 |
| 3 | Cruel Summer | Taylor Swift | 1 | 2019 | 8 | 23 | 7858 | 100 | 800840817 | 116 | ... | 170 | A | Major | 55 | 58 | 72 | 11 | 0 | 11 | 15 |
| 4 | WHERE SHE GOES | Bad Bunny | 1 | 2023 | 5 | 18 | 3133 | 50 | 303236322 | 84 | ... | 144 | A | Minor | 65 | 23 | 80 | 14 | 63 | 11 | 6 |
5 rows × 24 columns
Let's get our data prepped for plotting first by cleaning up of the some values.
def cleaning(df):
#Change the value of streams to a number
df['streams'] = pd.to_numeric(df['streams'], errors='coerce')
#Rename date columns and sort the songs by the date they were released
df = df.rename(columns={"released_year": "year", "released_month": "month", "released_day": "day" })
df = df.sort_values(by=['year', 'month', 'day'], ascending=True)
#Get rid of any NaN values
df = df.dropna()
return df
data = cleaning(data)
data.head()
| track_name | artist(s)_name | artist_count | year | month | day | in_spotify_playlists | in_spotify_charts | streams | in_apple_playlists | ... | bpm | key | mode | danceability_% | valence_% | energy_% | acousticness_% | instrumentalness_% | liveness_% | speechiness_% | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 439 | Agudo MÔøΩÔøΩgi | Styrx, utku INC, Thezth | 3 | 1930 | 1 | 1 | 323 | 0 | 90598517.0 | 4 | ... | 130 | F# | Minor | 65 | 49 | 80 | 22 | 4 | 7 | 5 |
| 469 | White Christmas | Bing Crosby, John Scott Trotter & His Orchestr... | 3 | 1942 | 1 | 1 | 11940 | 0 | 395591396.0 | 73 | ... | 96 | A | Major | 23 | 19 | 25 | 91 | 0 | 40 | 3 |
| 460 | The Christmas Song (Merry Christmas To You) - ... | Nat King Cole | 1 | 1946 | 11 | 1 | 11500 | 0 | 389771964.0 | 140 | ... | 139 | C# | Major | 36 | 22 | 15 | 84 | 0 | 11 | 4 |
| 466 | Let It Snow! Let It Snow! Let It Snow! | Frank Sinatra, B. Swanson Quartet | 2 | 1950 | 1 | 1 | 10585 | 0 | 473248298.0 | 126 | ... | 143 | D | Major | 60 | 86 | 32 | 88 | 0 | 34 | 6 |
| 496 | Jingle Bells - Remastered 1999 | Frank Sinatra | 1 | 1957 | 1 | 1 | 4326 | 0 | 178660459.0 | 32 | ... | 175 | G# | Major | 51 | 94 | 34 | 73 | 0 | 10 | 5 |
5 rows × 24 columns
First, I am curious to see what song Keys were most popular in 2023. To do this I will use Plotly's px.histogram to create a simple histogram showing the most popular song keys of the year by their counts.
def histogram(df):
#Use px.histogram to plot the histogram
fig = px.histogram(df,
x="key",
color="mode", #use color to create layered histograms
hover_data=df.columns #add hover labels to bars
)
#Update titles and formatting
fig.update_layout(title={'text': "Most Popular Song Keys of 2023",
'xanchor': 'center',
'yanchor': 'top',
'y':0.9,
'x':0.5
},
xaxis_title="Key",
yaxis_title="Number of Songs"
)
return fig
histogram(data)
Interesting! Our histogram has revealed that a key in C# Major was the most popular key of 2023.
Let's plot a more complex histogram to see what underlying trends there might be across popular song attributes. For this, we will drop the bars and use a line to view each attributes distrubition or "curve" and include an accompanying rug plot below.
def curve_and_rug(df):
#to plot a layered distribution, we will need to import figure factory first
import plotly.figure_factory as ff
#figure factory's distribution plot requires that we include histogram data as a list of lists
hist_data = [df['danceability_%'], df['energy_%'],df['valence_%'], df['liveness_%'], df['acousticness_%'], df['speechiness_%']]
#we will also need to give a label to each of these lists
group_labels = ['Danceability %', 'Energy %', 'Valence %', 'Liveness %', 'Acousticness %', 'Speechiness %']
#now we will plot the distrbution and set show_hist to 'False' to remove the bars
fig = ff.create_distplot(hist_data, group_labels, show_hist=False )
#Update titles and formatting
fig.update_layout(title={'text': "Distribution of Popular Song Attributes in 2023",
'xanchor': 'center',
'yanchor': 'top',
'y':0.9,
'x':0.5
})
return fig
curve_and_rug(data)
What do we have here...Our distribution reveals songs with a lower speechiness and liveness percentage were popular in the year of 2023. We can also see that high danceability and energy percentages were markers for the majority of popular songs over that year as well.
Now we will use plotly to create a scatterplot to view the relationship between danceability percent and BPM since high danceability stood out in our previous plot as a marker for song popularity.
def scatter(df):
#Plot a scatter plot using px.scatter
fig = px.scatter(df,
x='danceability_%',
y="bpm",
hover_name="track_name", #set the hover values to show names of tracks
size="streams", #set the point size to vary by amount of streams
color="key", #set the color of points to vary by categorical data key
trendline="ols", #add an ordinary least squares regression line
trendline_scope="overall", #set trendline scope to overall to avoid having a trendline for each key
)
#Update titles and formatting
fig.update_layout(title={'text': "2023 Popular Song BPM to Danceability",
'xanchor': 'center',
'yanchor': 'top',
'y':0.9,
'x':0.5},
xaxis_title="Danceability %",
yaxis_title="Beats Per Minute"
)
return fig
scatter(data)
Beautiful! This plot shows a negative relationship between BPM and danceability percentage across 2023's most popular songs. Perhaps too high of a BPM can hinder one's flow of moves.
Though our previous scatter plot is fun to look at, it is hard to see relationships between keys. Though we can see C# was popular still, F# due to its color stands out more. In general, there is too much caterogrical data noise. Let's add interactivity to remedy this by cycling through a regression plot on each individual key.
def scatter_interactivity(df):
#Again, plot using px.scatter
fig = px.scatter(df,
x='danceability_%',
y="bpm",
hover_name="track_name",
size="streams",
color="key",
trendline="ols", #This time we will use ordinary least squares regression without setting the trendline to "overall" since we are cycling through the keys
animation_frame="key", #setting the animation frame will create a slider for interactivty and set it to key to cycle through them
)
#Update titles and formatting
fig.update_layout(title={'text': "2023 Popular Song BPM to Danceability by Key",
'xanchor': 'center',
'yanchor': 'top',
'y':0.9,
'x':0.5},
xaxis_title="Danceability %",
yaxis_title="Beats Per Minute"
)
return fig
scatter_interactivity(data)
Huzzah! Now it is much easier to decipher the relationship between BPM and danceability percentages across keys. While overall the OLS lines are negative for the most part, being able to view them against keys shows some have a more negative relationship while others are more neutral. We can even see that a key of F has a postive relationship between BPM and danceability here which could help lead to more analysis we could not conduct otherwise.